A Survey on Partitioning Skew Diminishing Techniques in Hadoop MapReduce Environment
نویسنده
چکیده
In the era of Big Data, it creates large size of structured and unstructured data. MapReduce is an effective tool for parallel data processing. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data assigned to each task. This causes some tasks to take much longer to finish than others and can significantly impact performance. Parallel data processing is the spirit of the Apache’s Hadoop. Hadoop is an open source execution of Google’s MapReduce schedule. It has two components. Hadoop Distributed File System (HDFS) for storing data part and MapReduce for processing the data part. MapReduce has become a popular programming model for constructing data processing applications. While being widely used, existing MapReduce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is unevenly distributed among reduce tasks. It causes most of the tasks to take much longer to finish than other tasks and can significantly impact on performance. This paper reviews the types of partitioning skew and several skew diminishing techniques to solve these issues.
منابع مشابه
Handling Data Skew in MapReduce Cluster by Using Partition Tuning
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data pro...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملHandling partitioning skew in MapReduce using LEEN
MapReduce is emerging as a prominent tool for big data processing. Locality is a key feature in MapReduce that is extensively leveraged in dataintensive cloud system: it avoids network saturation when processing large amount of data by co-allocating computation and data storage — the map phase. However, our studies with Hadoop, a widely used MapReduce implementation, demonstrate that the presen...
متن کاملA Study of Skew in MapReduce Applications
This paper presents a study of skew — highly variable task runtimes — in MapReduce applications. We describe various causes and manifestations of skew as observed in real world Hadoop applications. Runtime task distributions from these applications demonstrate the presence and negative impact of skew on performance behavior. We discuss best practices recommended for avoiding such behavior and t...
متن کاملHandling Data Skew in MapReduce
MapReduce systems have become popular for processing large data sets and are increasingly being used in e-science applications. In contrast to simple application scenarios like word count, e-science applications involve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime complexity of the reducer task is typically high, and (b) scientific data is ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017